In Exploratory Data Analysis of tabular data, univariate analysis is the first step. It consists in exploring, summarizing, visualizing columns of a dataset. In this workbook we focus on univariate numerical samples. We explore techniques for:
- Summarizing univariate numerical samples
- Displaying numerical samples
This is also an opportunity to:
Setup
If the required packages have not (yet) been installed, install them.
General Social Survey (GSS) dataset
Load the cumulative GSS dataset (gss_all). Have a glimpse at the resulting dataframe. Load gss_dict.
- In dataset
gss_all, what do the rows stand for?
- In dataset
gss_all what do columns year and id stand for?
- For a given value of
id, can you find several rows ?
- For a given value of
year, can you find several rows with the same id?
- How many distinct values of
year can you find in gss_data?
- For each value of
year, how many people were surveyed?
- Why is this dataset called cumulative?
Table exploration
Load gss_sub which is much smaller than gss_all. Have a glimpse.
- How many variables can you find in
gss_sub?
- How many distinc values for each column?
- Which columns should be considered as categorical/factor?
In the sequel, we explore the age distribution as is the age column was a genuine univariate sample. This is done for teaching purpose. The age column is not collected by repeatedy picking individuals uniformly at random from a fixed population.
Indeed the age column is a union of samples collected every year or every two years since 1972. The American population has changed thoughout the five decades.
Moreover, yearly samples are not i.i.d. samples from the whole population. The sampling methods have varied over time. Sampling methods rely on multistage stratified sampling and quotas.
Exploring age column
For column age, disregarding any weighting process
- compute the summary.
- compute the range, the IQR, the standard deviation
- compute the Mean Absolute Deviation, the Median Absolute Deviation
Filter out rows with missing data in columns age or sex
Boxplots
- Build a boxplot for
age.
- Equip the plot with a title, a subtitle, a caption
- Annotate the boxplot with summary statistics.
- Build a boxplot of
age distribution according to sex.
- What is the impact of argument
varwidth=T?
- What is the impact of argument
notch=T?
- What is the difference between
stat_boxplot() and geom_boxplot()?
- How would you get rid of the useless ticks on the x-axis?
Histograms
- Plot a histogram of the
age distribution
- Facet by
sex
- Draw the
age distribution histograms for each sex on the same plot
- Facet by
sex and year
- Build an animated histogram plot where
frame is determined by year
Histograms are used to sketch possibly (absolutely) continuous distributions by using piecewise constant approximations of density functions. Histograms can also be viewed as column plots for binned data (that is discretizations of “continuous” data).
- Define breaks for
age data
- regular breaks with age ranges of length 5
- irregular breaks
[18-25[, [25, 35[, [35,50[, [50, 65[, [65,+∞[
- Bin
age according to defined breaks using cut()
- Plot the binned data using
geom_bar() or geom_col()
Demographers use population pyramids to sketch the age distribution in a population. Population pyramids are special facetted histograms or barplots.
- Plot an age-sex pyramid for the
gss sample.
- Animate with respect to
year
Density plots
Histograms deliver piecewise constant estimations/approximations of a population density. If we suspect the population density to be smooth, it is sensible to try to build smooth estimates/aproximations of the population density. This is the purpose of density estimates.
- Draw density plots for age distribution
- Use different bandwidths
- Use different kernels
- Facet by
sex
- Facet by
sex and year
- Overlay histograms and density plots (in
geom_histogram() use aes(y=after_stat(density)))
Build violine plots for age distribution (use geom_violine()).
Cumulative Distribution Functions
Not all probability distributions have densities, but all are characterized by their Cumulative Distribution Functions (CDFs). Each sample defines an Empirical Cumulative Distribution Function (ECDF).
- Compare the
age distributions for women and men using the Kolmogorov-Smirnov statistic (ks.test())
- How is the Kolmogorov-Smirnov statistic computed?
Quantile plots
The quantile function of a probability distribution is the (generalized, left-continuous) inverse of its CDF. Quantile functions are useful devices in EDA and random generation.
- Plot the quantile function of the
age empirical distribution
- Plot the quantile functions of the
age empirical distributions for men and women
- Design a function that takes as input a univariate numerical sample and returns the quantile function (in the same way as
ecdf() does)
- Draw a quantile-quantile plot to compare
age distribution for women and men with base R qqplot()
-
Draw a quantile-quantile plot to compare
age distribution for women and men using ggplot2.
How could you comply with the DRY principle ?